In [19]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
In [41]:
from Bio import SeqIO
from sklearn.cluster import AffinityPropagation, KMeans, DBSCAN
from sklearn.manifold import MDS, t_sne, Isomap
from sklearn.metrics import silhouette_score
from util.isoelectric_point import isoelectric_points
from IPython.display import Image
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_style('white')
sns.set_context('notebook')
By the end of this notebook, you will learn that:
model.fit(X)
and model.predict(X)
coding pattern.If I'm given a list of samples, how do I figure out what samples are similar to one another?
Here, we will explore this problem by using biological sequence data.
The data here are 1000 influenza H1 sequences. Let's see if we can find any interesting structure in this dataset.
Please uncomment the code the following 3 cells, and run them.
In [21]:
# Load the sequence data into memory.
# BioPython is a great Python package for reading and manipulating biological sequence data.
# Remember, though, that the data have to ultimately become a 2D matrix.
sequences = [s for s in SeqIO.parse('data/human_h1_aligned_downsampled.fasta', 'fasta')]
sequences[0:5]
Out[21]:
In [22]:
# Get a list of the sequence identifiers.
seqids = [s.id for s in SeqIO.parse('data/human_h1_aligned_downsampled.fasta', 'fasta')]
seqids[0:5]
Out[22]:
An important part of using machine learning algorithms is that your data have to be numeric. Strings and special values (inf
or NaN
) are not allowed in many of them.
With the sequence data, what could we replace the letters with?
In [25]:
# Make the data into a pandas dataframe, which is an acceptable data input for scikit-learn.
seq_df = pd.DataFrame(sequences)
seq_df.index = seqids
seq_df = seq_df.replace(isoelectric_points.keys(), isoelectric_points.values()).replace('-', 0)
seq_df.head()
Out[25]:
In [40]:
# Let's try
ap = AffinityPropagation()
ap.fit(seq_df) # trains the clustering algorithm on the seq_df.
ap_cluster_labels = ap.predict(seq_df)
In [28]:
# Use multi-dimensional scaling (MDS) to get a 2-D plane representation of the data, so that we can plot it.
mds = MDS()
mds_coords = mds.fit_transform(seq_df)
mds_coords
Out[28]:
In [29]:
# Let's plot this using matplotlib
plt.scatter(mds_coords[:,0], mds_coords[:,1])
Out[29]:
In [31]:
# Colour the clusters by cluster number, then plot on MDS plot.
cmap = plt.cm.get_cmap('gnuplot2', len(ap_cluster_labels))
plt.scatter(mds_coords[:,0], mds_coords[:,1], c=ap_cluster_labels, cmap=cmap)
Out[31]:
One metric for evaluating clustering results, which is provided by the scikit-learn
API, is the silhouette_score
.
The definition of the silhouette_score
is $ \dfrac{b - a}{max(a, b)} $. The score can take on values between -1 and 1, with -1 being the worst, and +1 being the best scores. 0 indicates overlapping clusters.
From the scikit-learn
API documentation:
The Silhouette Coefficient is calculated using the mean intra-cluster distance
a
and the mean nearest-cluster distanceb
for each sample. The Silhouette Coefficient for a sample is(b - a) / max(a, b)
. To clarify,b
is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficent is only defined if number of labels is 2 <= n_labels <= n_samples - 1.
Chalkboard example: three blobs
In [32]:
silhouette_score(seq_df, ap_cluster_labels)
Out[32]:
In [33]:
km = KMeans(n_clusters=2)
km_cluster_labels = km.fit_predict(seq_df)
km_cluster_labels[0:5]
Out[33]:
In [34]:
cmap = plt.cm.get_cmap('gnuplot2', len(km_cluster_labels))
plt.scatter(mds_coords[:,0], mds_coords[:,1], c=km_cluster_labels, cmap=cmap)
Out[34]:
In [35]:
silhouette_score(seq_df, km_cluster_labels)
Out[35]:
For comparison, try the DBSCAN algorithm, and do the same thing.
In [36]:
dbs = DBSCAN()
dbs_cluster_labels = dbs.fit_predict(seq_df)
dbs_cluster_labels[0:5]
Out[36]:
In [37]:
cmap = plt.cm.get_cmap('gnuplot2', len(dbs_cluster_labels))
plt.scatter(mds_coords[:,0], mds_coords[:,1], c=dbs_cluster_labels, cmap=cmap)
Out[37]:
In [38]:
# What are the silhouette scores for each of the 3 algorithms?
silhouette_score(seq_df, dbs_cluster_labels)
Out[38]:
Try one other clustering algorithm, listed here, and evaluate its performance.
In [ ]:
In [43]:
Image('http://scikit-learn.org/stable/_images/plot_cluster_comparison_001.png')
Out[43]:
clusterer.fit_predict(data)
manifold.fit_transform(data)
sklearn.metrics
module.
In [ ]: